Switch from `int64_t` to `int64_t*` to fit with the rest of the
implementation.
New tentative with some fix. The previous was reverted some time ago.
Reviewed in #138010
This patch updates several passes to include the DLTI dialect, since
their use of the `fir::support::getOrSetMLIRDataLayout()` utility
function could, in some cases, require this dialect to be loaded in
advance.
Also, the `CUFComputeSharedMemoryOffsetsAndSize` pass has been updated
with a dependency to the GPU dialect, as its invocation to
`cuf::getOrCreateGPUModule()` would result in the same kind of error if
no other operations or attributes from that dialect were present in the
input MLIR module.
Update `cuf.kernel_launch` to take the stream as a reference. Update the
conversion to insert the `cuf.stream_cast` op so the stream can be set
as dependency.
Currently when we do a CUDA data transfer from a constant, we embox it
and delegate the assignment to the runtime. When the type of the
constant is not exactly the same as the destination descriptor, the
runtime will emit an assignment mismatch error.
Convert the constant when necessary so the assignment is fine.
Move non-common files from FortranCommon to FortranSupport (analogous to
LLVMSupport) such that
* declarations and definitions that are only used by the Flang compiler,
but not by the runtime, are moved to FortranSupport
* declarations and definitions that are used by both ("common"), the
compiler and the runtime, remain in FortranCommon
* generic STL-like/ADT/utility classes and algorithms remain in
FortranCommon
This allows a for cleaner separation between compiler and runtime
components, which are compiled differently. For instance, runtime
sources must not use STL's `<optional>` which causes problems with CUDA
support. Instead, the surrogate header `flang/Common/optional.h` must be
used. This PR fixes this for `fast-int-sel.h`.
Declarations in include/Runtime are also used by both, but are
header-only. `ISO_Fortran_binding_wrapper.h`, a header used by compiler
and runtime, is also moved into FortranCommon.
As there is now certain areas where we now have the possibility of
having either a ModuleOp or GPUModuleOp and both of these modules can
have DataLayout's and we may require utilising the DataLayout utilities
in these areas I've taken the liberty of trying to extend them for use
with both.
Those with more knowledge of how they wish the GPUModuleOp's to interact
with their parent ModuleOp's DataLayout may have further alterations
they wish to make in the future, but for the moment, it'll simply
utilise the basic data layout construction which I believe combines
parent and child datalayouts from the ModuleOp and GPUModuleOp. If there
is no GPUModuleOp DataLayout it should default to the parent ModuleOp.
It's worth noting there is some weirdness if you have two module
operations defining builtin dialect DataLayout Entries, it appears the
combinatorial functionality for DataLayouts doesn't support the merging
of these.
This behaviour is useful for areas like:
https://github.com/llvm/llvm-project/pull/119585/files#diff-19fc4bcb38829d085e25d601d344bbd85bf7ef749ca359e348f4a7c750eae89dR1412
where we have a crossroads between the two different module operations.
Introduce a new op to get the device address from a host symbol. This
simplify the current conversion and this is also in preparation for some
legalization work that need to be done in cuf kernel and cuf kernel
launch similar to
https://github.com/llvm/llvm-project/pull/122802
Introduce cuf.sync_descriptor to be used to sync device global
descriptor after pointer association.
Also move CUFCommon so it can be used in FIRBuilder lib as well.
1. In `CufOpConversion` `isDeviceGlobal` was renamed
`isRegisteredGlobal` and moved to the common file. `isRegisteredGlobal`
excludes constant `fir.global` operation from registration. This is to
avoid calls to `_FortranACUFGetDeviceAddress` on globals which do not
have any symbols in the runtime. This was done for
`_FortranACUFRegisterVariable` in #118582, but also needs to be done
here after #118591
2. `CufDeviceGlobal` no longer adds the `#cuf.cuda<constant>` attribute
to the constant global. As discussed in #118582 a module variable with
the #cuf.cuda<constant> attribute is not a compile time constant. Yet,
the compile time constant also needs to be copied into the GPU module.
The candidates for copy to the GPU modules are
- the globals needing regsitrations regardless of their uses in device
code (they can be referred to in host code as well)
- the compile time constant when used in device code
3. The registration of "constant" module device variables (
#cuf.cuda<constant>) can be restored in `CufAddConstructor`
Add pattern that update fir.declare memref when it comes from a device
global and is not a descriptor. In that case, we recover the device
address that needs to be used in ops like `fir.array_coor` and so on.
Materialize the box when the src comes from a embox or rebox operation.
This was done in the case of transfer to a descriptor but not when
transferring from a descriptor.
- Update the runtime entry points to accept a stream information
- Update the conversion of `cuf.allocate` to pass correctly the stream
information when present.
Note that the stream is not currently used in the runtime. This will be
done in a separate patch as a design/solution needs to be down together
with the allocators.
When the src of the data transfer is a constant, it needs to be
materialized in memory to be able to perform a data transfer.
```
subroutine sub1()
real, device :: a(10)
integer :: I
do i = 5, 10
a(i) = -4.0
end do
end
```
The runtime Assign function is not meant to initialize an array from a
scalar. For that we need to use DoAssignFromSource. Update the data
transfer from scalar to descriptor to use a new entry point that use
this function underneath.
When an array is declared with a non default lower bound, the declare op
`getShape` will return a `ShapeShiftOp`. This result is used in data
transfer operation to compute the number of bytes to transfer. Update
the op to support `ShapeShiftOp`.
Number of bytes to allocate was not computed when using `cuf.alloc` with
a derived type. Update the conversion to compute the number of bytes and
emit an error when type is not supported.