431 Commits

Author SHA1 Message Date
Slava Zakharin
711419e302
[flang] Enable loop-versioning for slices. (#120344)
Loops resulting from array expressions like array(:,i)
may be versioned for the unit stride of the innermost dimension,
when the initial array is an assumed-shape array (which are contiguous
in many Fortran programs).
This speeds up facerec for about 12% due to further vectorization
of the innermost loop produced for the total SUM reduction.
2024-12-23 07:53:10 -08:00
Kazu Hirata
392651a7ec
[flang] Migrate away from PointerUnion::{is,get} (NFC) (#120880)
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:

  // FIXME: Replace the uses of is(), get() and dyn_cast() with
  //        isa<T>, cast<T> and the llvm::dyn_cast<T>

I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.
2024-12-22 13:30:16 -08:00
Valentin Clement (バレンタイン クレメン)
415cfaf339
[flang][cuda][NFC] Fix type in CUFFreeDescriptor (#120799) 2024-12-20 14:43:12 -08:00
Valentin Clement (バレンタイン クレメン)
e650ac1654
[flang][cuda][NFC] Fix typo in CUFAllocDescriptor (#120797)
Missing `r` in the function name.
2024-12-20 13:57:47 -08:00
Jacques Pienaar
09dfc5713d
[mlir] Enable decoupling two kinds of greedy behavior. (#104649)
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.

These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.

Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.

For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
2024-12-20 08:15:48 -08:00
Valentin Clement (バレンタイン クレメン)
e93d226664
[flang][cuda] Update CompilerGeneratedNames pass to work on gpu module (#120660)
- Update `CompilerGeneratedNames` so it can perform renaming in
gpu.module
- Update Codegen so it look in the correct module for the type
descriptor.
2024-12-19 19:07:00 -08:00
Valentin Clement (バレンタイン クレメン)
37978c466b
[flang][cuda] Remove unused variable 2024-12-12 15:04:16 -08:00
Valentin Clement (バレンタイン クレメン)
ea04148c27
[flang][cuda] Extend implicit global handling to any type descriptor (#119769)
Relax the check to also handle other type descriptor globals.
2024-12-12 14:52:49 -08:00
Valentin Clement (バレンタイン クレメン)
956d0dd624
[flang][cuda] Support builtin global in device global pass (#119626) 2024-12-11 17:09:56 -08:00
khaki3
609899f443
[flang][cuda] Avoid stack corruption when setting kernel launch parameters (#119469)
In order to get the pointer to a structure member, `getelementptr`
typically requires two indices: one to indicate the structure itself,
and another to specify the member's position. We are missing the former
in `GPULaunchKernelConversion`, so generated code may cause stack
corruption. This PR corrects the indices of a structure used as a kernel
launch temp.
2024-12-10 16:08:22 -08:00
Valentin Clement (バレンタイン クレメン)
850c932f05
[flang][cuda] Walk through cuf kernel for implicit globals (#119455)
Globals used in cuf kernel need to be flagged as well.
2024-12-10 14:01:53 -08:00
khaki3
e9866d5d14
[flang][cuda] Fix GPULaunchKernelConversion to generate correct kernel launch parameters (#119431)
For the call to _FortranACUFLaunchKernel, we store the pointer to a
member of a temporary structure in a parameter array. However, when we
obtain an element pointer from the parameter array, its address is
calculated based on the type of the structure. This PR properly treats
the parameter array as an array of pointers.

Example:

```mlir
%30 = llvm.load %29 : !llvm.ptr -> i32
%31 = llvm.mlir.constant(1 : i32) : i32
%32 = llvm.alloca %31 x !llvm.struct<(i64, i64, i32, ptr)> : (i32) -> !llvm.ptr
%33 = llvm.mlir.constant(4 : i32) : i32
%34 = llvm.alloca %33 x !llvm.ptr : (i32) -> !llvm.ptr
%35 = llvm.mlir.constant(0 : i32) : i32
%36 = llvm.getelementptr %32[%35] : (!llvm.ptr, i32) -> !llvm.ptr, !llvm.struct<(i64, i64, i32, ptr)>
llvm.store %8, %36 : i64, !llvm.ptr
%37 = llvm.getelementptr %34[%35] : (!llvm.ptr, i32) -> !llvm.ptr, !llvm.struct<(i64, i64, i32, ptr)>
llvm.store %36, %37 : !llvm.ptr, !llvm.ptr
...
llvm.call @_FortranACUFLaunchKernel(%47, %8, %8, %8, %2, %8, %8, %7, %34, %48) : (!llvm.ptr, i64, i64, i64, i64, i64, i64, i32, !llvm.ptr, !llvm.ptr) -> () 
```
In this example, `%37 = llvm.getelementptr %34[%35] : (!llvm.ptr, i32)
-> !llvm.ptr, !llvm.struct<(i64, i64, i32, ptr)>` will be `%37 =
llvm.getelementptr %34[%35] : (!llvm.ptr, i32) -> !llvm.ptr, !llvm.ptr`.
2024-12-10 11:32:32 -08:00
Yusuke MINATO
a88677edc0
Reland "[flang] Integrate the option -flang-experimental-integer-overflow into -fno-wrapv" (#118933)
This relands #110063.
The performance issue on 503.bwaves_r is found not to be related to the
patch, and is resolved by fbd89bcc when LTO is enabled.
2024-12-10 16:26:53 +09:00
Valentin Clement (バレンタイン クレメン)
a1d71c3693
[flang][cuda] Additional update to ExternalNameConversion (#119276) 2024-12-09 17:39:51 -08:00
Valentin Clement (バレンタイン クレメン)
75623bfe1b
[flang][cuda] Handle gpu.return in AbstractResult pass (#119035) 2024-12-09 17:39:16 -08:00
Valentin Clement (バレンタイン クレメン)
1d4b5c161f
[flang][cuda] Change how abstract result pass is scheduled on func.func and gpu.func (#119034)
Use `pm.nest` to schedule the pass on nested `func.func` and `gpu.func`
in the `gpu.module`.

AbstractResult pass is not meant to run on the whole gpu.module at once.
2024-12-09 13:31:27 -08:00
Renaud Kauffmann
27e458c8cb
[flang][cuda] Distinguish constant fir.global from globals with a #cuf.cuda<constant> attribute (#118912)
1. In `CufOpConversion` `isDeviceGlobal` was renamed
`isRegisteredGlobal` and moved to the common file. `isRegisteredGlobal`
excludes constant `fir.global` operation from registration. This is to
avoid calls to `_FortranACUFGetDeviceAddress` on globals which do not
have any symbols in the runtime. This was done for
`_FortranACUFRegisterVariable` in #118582, but also needs to be done
here after #118591
2. `CufDeviceGlobal` no longer adds the `#cuf.cuda<constant>` attribute
to the constant global. As discussed in #118582 a module variable with
the #cuf.cuda<constant> attribute is not a compile time constant. Yet,
the compile time constant also needs to be copied into the GPU module.
The candidates for copy to the GPU modules are
- the globals needing regsitrations regardless of their uses in device
code (they can be referred to in host code as well)
       - the compile time constant when used in device code 

3. The registration of "constant" module device variables (
#cuf.cuda<constant>) can be restored in `CufAddConstructor`
2024-12-05 18:36:48 -08:00
Valentin Clement (バレンタイン クレメン)
7efd6139f2
[flang][cuda] Get device address in fir.declare (#118591)
Add pattern that update fir.declare memref when it comes from a device
global and is not a descriptor. In that case, we recover the device
address that needs to be used in ops like `fir.array_coor` and so on.
2024-12-04 13:36:58 -08:00
Renaud Kauffmann
ed2db3be61
[flang][cuda] Do not register global constants (#118582)
Global constants have no symbols in library files. They are replaced
with literal constants during lowering before kernels are moved into a
GPU module. Do not register them because they will result in unresolved
symbols.
2024-12-04 09:37:08 -08:00
Valentin Clement (バレンタイン クレメン)
5522d2462e
[flang][cuda] Allow AbstractResult to run in gpu.module (#118529)
in CUDA Fortran, device function are converted to `gpu.func` inside the
`gpu.module` operation. Update the AbstractResult pass to be able to run
on `func.func` and `gpu.func` operations inside the `gpu.module`.
2024-12-03 14:04:49 -08:00
s-watanabe314
f3cf24fcc4
[flang] Apply nocapture attribute to dummy arguments (#116182)
Apply llvm.nocapture attribute to dummy arguments that do not have the
target, asynchronous, volatile, or pointer attributes in a procedure
that is not a bind(c). This was discussed in


https://discourse.llvm.org/t/applying-the-nocapture-attribute-to-reference-passed-arguments-in-fortran-subroutines/81401
2024-11-28 15:39:26 +09:00
Valentin Clement (バレンタイン クレメン)
b5825963f0
[flang][cuda] Materialize box when needed (#117810)
Materialize the box when the src comes from a embox or rebox operation.
This was done in the case of transfer to a descriptor but not when
transferring from a descriptor.
2024-11-26 17:36:25 -08:00
jeanPerier
cf602b95d1
[flang] handle fir.call in AliasAnalysis::getModRef (#117164)
fir.call side effects are hard to describe in a useful way using
`MemoryEffectOpInterface` because it is impossible to list which memory
location a user procedure read/write without doing a data flow analysis
of its body (even PURE procedures may read from any module variable,
Fortran SIMPLE procedure from F2023 will allow that, but they are far
from common at that point).

Fortran language specifications allow the compiler to deduce
that a procedure call cannot access a variable in many cases 
This patch leverages this to extend `fir::AliasAnalysis::getModRef` to
deal with fir.call.

This will allow implementing "array = array_function()" optimization in
a future patch.
2024-11-26 11:17:33 +01:00
Valentin Clement (バレンタイン クレメン)
eb5cda480d
[flang][cuda] cuf.allocate: Carry over stream to the runtime call (#117631)
- Update the runtime entry points to accept a stream information
- Update the conversion of `cuf.allocate` to pass correctly the stream
information when present.

Note that the stream is not currently used in the runtime. This will be
done in a separate patch as a design/solution needs to be down together
with the allocators.
2024-11-25 20:46:24 -08:00
Valentin Clement (バレンタイン クレメン)
5802367ddb
[flang][cuda] Add support for allocate with source (#117388)
Add support for allocate statement with CUDA device variable and a
source.
2024-11-22 16:55:26 -08:00
Valentin Clement (バレンタイン クレメン)
a76609dd72
[flang][cuda] Avoid intrinsics simplification in device context (#117026) 2024-11-21 10:37:38 -08:00
Valentin Clement (バレンタイン クレメン)
ecda14069f
[flang][cuda] Adapt ExternalNameConversion to work in gpu module (#117039) 2024-11-20 15:30:05 -08:00
Valentin Clement (バレンタイン クレメン)
01cd7ad2ba
[flang][cuda] Do not generate NVVM target attribute when creating the module (#116882)
Leave it to the `NVVMAttachTargetPass` so we can set compute capability
and features.
2024-11-19 16:55:34 -08:00
Valentin Clement (バレンタイン クレメン)
4d7df40c08
[flang][cuda] Materialize constant src in memory (#116851)
When the src of the data transfer is a constant, it needs to be
materialized in memory to be able to perform a data transfer.

```
subroutine sub1()
  real, device :: a(10)
  integer :: I

  do i = 5, 10
    a(i) = -4.0
  end do
end
```
2024-11-19 14:11:20 -08:00
Valentin Clement (バレンタイン クレメン)
ca79e12648
[flang][cuda] Handle implicit global in cuf kernel and nested statement (#116846)
Update the implicit global detection by looking for them in the CUF
kernel and also update to a walk so nested `fir.address_of` in nested
statement are also accounted for.
2024-11-19 12:38:18 -08:00
Valentin Clement (バレンタイン クレメン)
de2e270ee6
[flang][cuda] Materialize box when src or dst are rebox (#116494) 2024-11-18 09:22:12 -08:00
Abid Qadeer
030179c2cb
[flang][debug] Support ClassType. (#114809)
This PR adds the handling of `ClassType`. It is treated as pointer to
the underlying type. Note that `ClassType` when passed to the function
have double indirection so it is represented as pointer to type
(compared to other types which may have a single indirection).

If `ClassType` wraps a pointer or allocatable then we take care to
generate it as PTR -> type (and not PTR -> PTR -> type).

This is how it looks like in the debugger.

```
subroutine test_proc (this)
    class(test_type), intent (inout) :: this
    allocate (this%b (3, 2))
    call fill_array_2d (this%b)
    print *, this%a
end
```

```
(gdb) p this
$6 = (PTR TO -> ( Type test_type )) 0x2052a0
(gdb) p this%a
$7 = 0
(gdb) p this%b
$8 = ((1, 2, 3) (4, 5, 6))

```
2024-11-18 11:26:35 +00:00
Valentin Clement
42be165dde Reland '[flang][cuda] Specialize entry point for scalar to desc data transfer' 2024-11-15 19:13:55 -08:00
Valentin Clement (バレンタイン クレメン)
70b9440c88
Revert "[flang][cuda] Specialize entry point for scalar to desc data transfer" (#116458)
Reverts llvm/llvm-project#116457
2024-11-15 17:44:48 -08:00
Valentin Clement (バレンタイン クレメン)
43cb424a54
[flang][cuda] Specialize entry point for scalar to desc data transfer (#116457)
The runtime Assign function is not meant to initialize an array from a
scalar. For that we need to use DoAssignFromSource. Update the data
transfer from scalar to descriptor to use a new entry point that use
this function underneath.
2024-11-15 17:41:23 -08:00
Valentin Clement (バレンタイン クレメン)
b1fa9d154b
[flang][cuda] Correctly embox logical constant (#116445) 2024-11-15 15:29:41 -08:00
Valentin Clement (バレンタイン クレメン)
012fad975e
[flang][cuda] Materialize the box in memory when dst is emboxed (#116320)
Similar to #116289 but for the dst.
2024-11-15 14:31:36 -08:00
Valentin Clement (バレンタイン クレメン)
e8469f1577
[flang][cuda] Add support for character type in cuf.alloc and cuf.data_transfer (#116277)
Add support for character type in bytes computation
2024-11-15 14:31:21 -08:00
Valentin Clement (バレンタイン クレメン)
98daf22638
[flang][cuda] Materialize the box in memory when src is emboxed (#116289) 2024-11-14 18:33:14 -08:00
Valentin Clement (バレンタイン クレメン)
02018cf793
[flang][cuda][NFC] Use mlir::emitError to get location (#116267)
Use `mlir::emitError` so we can get location information on error.
2024-11-14 10:32:09 -08:00
Valentin Clement (バレンタイン クレメン)
d133a3ee9d
[flang][cuda] Add conversion after CUFGetDeviceAddress to avoid issue when emboxing (#116145) 2024-11-14 09:03:15 -08:00
Valentin Clement (バレンタイン クレメン)
ec066d30e2
[flang][cuda] cuf.alloc in device context should be converted to fir.alloc (#116110)
Update `inDeviceContext` to account for the gpu.func operation.
2024-11-13 14:57:42 -08:00
Valentin Clement (バレンタイン クレメン)
e457861647
[flang][cuda] Support shape shift in data transfer op. (#115929)
When an array is declared with a non default lower bound, the declare op
`getShape` will return a `ShapeShiftOp`. This result is used in data
transfer operation to compute the number of bytes to transfer. Update
the op to support `ShapeShiftOp`.
2024-11-13 11:13:19 -08:00
Valentin Clement (バレンタイン クレメン)
2583071fb4
[flang][cuda] Compute size of derived type arrays (#115914) 2024-11-12 21:23:58 -08:00
Valentin Clement (バレンタイン クレメン)
853d52b838
[flang][cuda] Support derived type in cuf.data_transfer conversion (#115557)
Support derived type in `cuf.data_transfer` conversion by computing
their size in bytes.
2024-11-12 10:05:53 -08:00
Valentin Clement (バレンタイン クレメン)
d4eb430c9e
[flang][cuda] Support derived type in cuf.alloc (#115550)
Number of bytes to allocate was not computed when using `cuf.alloc` with
a derived type. Update the conversion to compute the number of bytes and
emit an error when type is not supported.
2024-11-08 14:32:00 -08:00
Valentin Clement (バレンタイン クレメン)
ef8d88ca1a
[flang][cuda] Support scalar to array data transfer (#115273)
Do it via descriptor assignment until we have a more efficient way.
2024-11-07 09:27:10 -08:00
Valentin Clement (バレンタイン クレメン)
db69d6939a
[flang][cuda] Support data transfer from descriptor to a pointer (#115023)
Data transfer from a variable with a descriptor to a pointer. We create
a descriptor for the pointer so we can use the flang runtime to perform
the transfer. The Assign function handles all corner cases. We add a new
entry points `CUFDataTransferDescDescNoRealloc` to avoid reallocation
since the variable on the LHS is not an allocatable.
2024-11-05 11:59:08 -08:00
Abid Qadeer
a993dfcdbf
[flang][debug] Support assumed-rank arrays. (#114404)
The assumed-rank array are represented by DIGenericSubrange in debug
metadata. We have to provide 2 things.

1. Expression to get rank value at the runtime from descriptor.

2. Assuming the dimension number for which we want the array information
has been put on the DWARF expression stack, expressions which will
extract the lowerBound, count and stride information from the descriptor
for the said dimension.

With this patch in place, this is how I see an assumed_rank variable
being evaluated by GDB.

```
function mean(x) result(y)
integer, intent(in) :: x(..)
...
end

program main
use mod
implicit none
integer :: x1,xvec(3),xmat(3,3),xtens(3,3,3)
x1 = 5
xvec = 6
xmat = 7
xtens = 8
print *,mean(xvec), mean(xmat), mean(xtens), mean(x1)
end program main

(gdb) p x
$1 = (6, 6, 6)

(gdb) p x
$2 = ((7, 7, 7) (7, 7, 7) (7, 7, 7))

(gdb) p x
$3 = (((8, 8, 8) (8, 8, 8) (8, 8, 8)) ((8, 8, 8) (8, 8, 8) (8, 8, 8)) ((8, 8, 8) (8, 8, 8) (8, 8, 8)))

(gdb) p x
$4 = 5
```
2024-11-05 18:49:29 +00:00
Valentin Clement (バレンタイン クレメン)
652db7e4ff
[flang][cuda] Support data transfer from pointer to a descriptor (#114892)
When source is a pointer to an array or a scalar, embox it and use the
`CUFDataTransferDescDesc` or `CUFDataTransferGlobalDescDesc` entry
points. The runtime is already able to deal with all the corner cases
like non contiguous arrays and so on so we exploit this.

Memset might still be used for simple case where we want to initialize
to 0 for example. This will come in a follow up patch.
2024-11-05 08:56:19 -08:00